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Abstract 



An ongoing challenge in the analysis of document collections is how to summarize content 
in terms of a set of inferred themes that can be interpreted substantively in terms of topics. 
However, the current practice of parameterizing the themes in terms of most frequent words 
limits interpretability by ignoring the differential use of words across topics. We argue that 
words that are both common and exclusive to a theme are more effective at characterizing top- 
ical content. We consider a setting where professional editors have annotated documents to a 
collection of topic categories, organized into a tree, in which leaf-nodes correspond to the most 
specific topics. Each document is annotated to multiple categories, possibly at different levels 
of the tree. We introduce Hierarchical Poisson Convolution (HPC) as a model to analyze anno- 
tated documents in this setting. The model leverages the structure among categories defined by 
professional editors to infer a clear semantic description for each topic in terms of words that 
are both frequent and exclusive. We develop a parallelized Hamiltonian Monte Carlo sampler 
that allows the inference to scale to millions of documents. 



Keywords: High-dimensional Data; Categorical Data; Hamiltonian Monte Carlo; Parallel In- 
ference; Text Analysis 
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1 Introduction 



A recurrent challenge in the multivariate statistics is how to construct interpretable low-dimensional 
summaries of high-dimensional data. Historically, simple models based on correlation matrices, 
such as principal component analysis (Jolliffe, 1986) and canonical correlation analysis (Hotelling, 
1936), have proven to be effective tools for data reduction. More recently, multilevel models have 
become a flexible and powerful tool for finding latent structure in high dimensional data (McLach- 
lan and Peel, 2000; Sohn and Xing, 2009; Blei et al., 2003b; Airoldi et al., 2008). However, 
while interpretable statistical summaries are highly valued in applications, dimensionality reduc- 
tion models are rarely optimized to aid qualitative discovery; there is no guarantee that the optimal 
low-dimensional projections will be understandable in terms of quantities of scientific interest that 
can help practitioners make decisions. Instead, we design a model with scientific estimands of 
interest in mind to achieve an optimal balance of interpretability and dimensionality reduction. 

We consider a setting in which we observe two sets of categorical data for each unit of obser- 
vation: wi, v , which live in a high-dimensional space, and li :K , which live in a structured low- 
dimensional space and provide a direct link to information of scientific interest about the sampling 
units. The goal of the analysis is two fold. First, we desire to develop a joint model for the observa- 
tions Y = {Wdxv, L DxK } that can be used to project the data onto a low-dimensional parameter 
space © in which interpretability is maintained by mapping categories in C to directions in ©. 
Second, we would like the mapping from the original space to the low-dimensional projection to 
be scientifically interesting so that statistical insights about © can be understood in terms of the 
original inputs, Wi : y, in a way that guides future research. 

In the application to text analysis that motivates this work, Wi : ^ are the raw word counts ob- 
served in each document and 1\-k are a set of labels created by professional editors that are indica- 
tive of topical content. Specifically, the words are represented as an unordered vector of counts, 
with the length of the vector corresponding to the size of a known dictionary. The labels are orga- 
nized in a tree- structured ontology, from the most generic topic at the root of the tree to the most 
specific topic at the leaves. Each news article may be annotated with more than one label, at the 
editors' discretion. The number of labels is given by the size of the ontology and typically ranges 
from tens to hundreds of categories. In this context, the inferential challenge is to discover a low 
dimensional representation of topical content, ©, that aligns with the coarse labels provided by 
editors while at the same time providing a mapping between the textual content and directions in 
© in a way that formalizes and enhances our understanding of how low dimensional structure is 
expressed the space of observed words. 

Recent approaches to this problem in the machine learning literature have taken a Bayesian 
hierarchical approach to this task by viewing a document's content as arising from a mixture of 
component distributions, commonly referred to as "topics" as they often capture thematic structure 
(Blei., 2012). As the component distributions are almost exclusively parameterized as multinomial 
distributions over words in the vocabulary, the loading of words onto topics is characterized in 
terms of the relative frequency of within-component usage. While relative frequency has proven to 
be a useful mapping of topical content onto words, recent work has documented a growing list of 
interpretability issues with frequency-based summaries: they are often dominated by contentless 
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"stop" words (Wallach et al., 2009), sometimes appear incoherent or redundant (Mimno et al., 
2011; Chang et al., 2009; Airoldi et al., 2010), and typically require post hoc modification to 
meet human expectations (Hu et al., 201 1; Grimmer and King, 201 1). Instead, we propose a new 
mapping for topical content that incorporates how words are used differentially across topics. If 
a word is common in a topic, it is also important to know whether it is common in many topics 
or relatively exclusive to the topic in question. Both of these summary statistics are informative: 
nonexclusive words are less likely to carry topic -specific content, while infrequent words occur 
too rarely to form the semantic core of a topic. We therefore look for the most frequent words 
in the corpus that are also likely to have been generated from the topic of interest to summarize 
its content. In this approach we borrow ideas from the statistical literature, in which models of 
differential word usage have been leveraged for analyzing writing styles in a supervised setting 
(Mosteller and Wallace, 1984; Airoldi et al., 2005, 2006, 2007; Monroe et al., 2008), and combine 
them with ideas from the machine learning literature, in which latent variable and mixture models 
based on frequent word usage have been used to infer structure that often captures topical content 
(McCallum et al., 1998; Blei et al., 2003b; Canny, 2004). 

From a statistical perspective, models based on topic- specific distributions over the vocabulary 
cannot produce stable estimates of differential usage since they only model the relative frequency 
of words within topics. They cannot regularize usage across topics and naively infer the greatest 
differential usage for the rarest features (Eisenstein et al., 201 1). To tackle this issue, we introduce 
the generative framework of Hierarchical Poisson Convolution (HPC) that parameterizes topic- 
specific word counts as unnormalized count variates whose rates can be regularized across topics 
as well as within them, making stable inference of both word frequency and exclusivity possible. 
HPC can be seen as a fully generative extension of Sparse Topic Coding (Zhu and Xing, 201 1) that 
emphasizes regularization and interpretability rather than exact sparsity. Additionally, HPC lever- 
ages hierarchical systems of topic categories created by professional editors in collections such as 
Reuters, New York Times, Wikipedia, and Encyclopedia Britannica to make focused comparisons 
of differential use between neighboring topics on the tree and build a sophisticated joint model for 
topic memberships and labels in the documents. By conditioning on a known hierarchy, we avoid 
the complicated task of inferring hierarchical structure (Blei et al., 2003a; Mimno et al., 2007; 
Adams et al., 2010). We introduce a parallelized Hamiltonian Monte Carlo (HMC) estimation 
strategy that makes full Bayesian inference efficient and scalable. 

The proposed model is designed to infer an interpretable description of human- generated la- 
bels, thus we restrict the topic components to have a one-to-one correspondence with the human- 
generated labels, as in Labeled LDA (Ramage et al., 2009). This descriptive link between the 
labels and topics differs from the predictive link used in Supervised LDA (Blei and McAuliffe, 
2007; Perotte et al., 2012), where topics are learned as an optimal covariate space to predict an ob- 
served document label or response variable. The more restrictive descriptive link can be expected 
to limit predictive power, but is crucial for learning summaries of individual labels. We then infer 
a description of these labels in terms of words that are both frequent and exclusive. We antici- 
pate that learning a concise semantic description for any collection of topics implicitly defined by 
professional editors is the first step toward the semi-automated creation of domain-specific topic 
ontologies. Domain- specific topic ontologies may be useful for evaluating the semantic content 
of inferred topics, or for predicting the semantic content of new social media, including Twitter 
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Figure 1: Graphical representation of Hierarchical Poisson Convolution (left) and detail on tree 
plate (right) 




messages and Facebook wall-posts. 

2 Hierarchical Poisson Convolution 

The Hierarchical Poisson Convolution model is a data generating process for document collections 
whose topics are organized in a hierarchy, and whose topic labels are observed. We refer to the 
structure among topics interchangeably as a hierarchy or tree since we assume that each topic 
has exactly one parent and that no cyclical parental relations are allowed. Each document d G 
{1, . . . , D} is a record of counts Wfd for every feature in the vocabulary, / 6 {1, . . . , V}. The 
length of the document is given by Ld, which we normalize by the average document length L to get 
Id = j-Ld- Documents have unrestricted membership to any combination of topics k G {1, . . . , K} 
represented by a vector of labels I d where I dk = /{doc d belongs to topic k}. 

2.1 Modeling word usage rates on the hierarchy 

The HPC model leverages the known topic hierarchy by assuming that words are used similarly in 
neighboring topics. Specifically, the log rate for a word across topics follows a Gaussian diffusion 
down the tree. Consider the topic hierarchy presented in the right panel of Figure 1. At the top 
level, fiffi represents the log rate for feature / overall in the corpus. The log rates fj,^, . . . , /^y for 
first level topics are then drawn from a Gaussian centered around the corpus rate with dispersion 
controlled by the variance parameter r| . From first level topics, we then draw the log rates for 
the second level topics from another Gaussian centered around their mean fj,fj and with variance 
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Table 1: Generative process for Hierarchical Poisson Convolution 



Step 


Generative process 


Tree parameters 


For feature f G {1, . . . , V\: 




• Draw Hffl ~ N^ip, 7 2 ) 




• Draw r 2 n ~ Scaled mv-y 2 (za a 2 ) 




• For j G {1, . . . , J} (first level of hierarchy): 




- Draw fi fd ~ Af{nf,o, r| ) 




- Draw Tj^ ~ Scaled Inv-x 2 (^, <r 2 ) 




• ror j t |i 5 . . . , J j ^lerniiiidi level 01 nierdrcnyj. 




- Draw M/Ji. ■ • ■ > M/JJ ~ mUfji-rfj) 




• Define /3 /jfe = e^. fc for fe G {1, ... , if} 


Topic membership 


For document d G {1, . . . , £)}: 


parameters 


• Draw i d ~ A/"(fy, £ = A 2 ix) 




• ror topic k G {1, . . . , K j: 




- Define p dfe = 1/(1 + e ?dfc ) 




- Draw idk ^ Bernoulli \pdk) 




- Define 6> dfc (J d , &) = e^I dk / Y$ =1 e Uj hj 


Data generation 


For document d G {1, . . . , D}: 




• Draw normalized document length ~ ^Pois(z;) 




• For every topic k and feature /: 




- Draw count Wfdk ~ P°i s (^#J/3/) 




• Define Wfd = J2k=i w fdk (observed data) 



r 2 -. This process is continued down the tree, with each parent node having a separate variance 
parameter to control the dispersion of its children. 

The variance parameters r 2 p directly control the local differential expression in a branch of the 
tree. Words with high variance parameters can have rates in the child topics that differ greatly 
from the parent topic p, allowing the child rates to diverge. Words with low variance parame- 
ters will have rates close to the parent and so will be expressed similarly among the children. If 
we learn a population distribution for the r 2 p that has low mean and variance, it is equivalent to 
saying that most features are expressed similarly across topics a priori and that we would need a 
preponderance of evidence to believe otherwise. 
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2.2 Modeling the topic membership of documents 



Documents in the HPC model can contain content from any of the K topics in the hierarchy at 
varying proportions, with the exact allocation given by the vector 6 d on the K — 1 simplex. The 
model assumes that the count for word / contributed by each topic follows a Poisson distribution 
whose rate is moderated by the document's length and membership to the topic; that is, Wf d k ~ 
Pois(l d 9dk(3fk)- The only data we observe is the total word count Wf d = J2k=i w fdk, but the infinite 
divisibility property of the Poisson distribution gives us that wj d ~ Pois(l d 6j(3f). These draws 
are done for every word in the vocabulary (using the same 6 d ) to get the content of the document. 1 

In labeled document collections, human coders give us an extra piece of information for each 
document, I d , that indicates the set of topics that contributed its content. As a result, we know 
Qdk — for all topics k where I d k = 0, and only have to determine how content is allocated 
between the set of active topics. 

The HPC model assumes that these two sources of information for a document are not generated 
independently. A document should not have a high probability of being labeled to a topic from 
which it receives little content and vice versa. Instead, the model posits a latent /('-dimensional 
topic affinity vector £ d ~ J\f(r), S) that expresses how strongly the document is associated with 
each topic. The topic memberships and labels of the document are different manifestations of 
this affinity. Specifically, each is the log odds that topic label k is active in the document, 
with I d k ~ Bernoulli(logit _1 (^fc)). Conditional on the labels, the topic memberships are the 
relative sizes of the document's affinity for the active topics and zero for inactive topics: 6 dk = 
^ dk Idk/ J2f=i e ^ dJ Idj- Restricting each document's membership vectors to the labeled topics is a 
natural and efficient way to generate sparsity in the mixing parameters, stabilizing inference and 
reducing the computational burden of posterior simulation. 

We outline the generative process in full detail in Table 1, which can be summarized in three 
steps. First, a set of rate and variance parameters are drawn for each feature in the vocabulary. 
Second, a topic affinity vector is drawn for each document in the corpus, which generate topic 
labels. Finally, both sets of parameters are then used to generate the words in each document. For 
simplicity of presentation we assume that each non-terminal node has J children and that the tree 
has only two levels below the corpus level, but the model can accommodate any tree structure. 

2.3 Estimands 

In order to measure topical semantic content, we consider the topic- specific frequency and exclu- 
sivity of each word in the vocabulary. These quantities form a two-dimensional summary of each 
word's relation to a topic of interest, with higher scores in both being positively related to topic 
specific content. Additionally, we develop a univariate summary of semantic content that can be 
used to rank words in terms of their semantic content. These estimands are simple functions of 
the rate parameters of HPC; the distribution of the documents' topic memberships is a nuisance 
parameter needed to disambiguate the content of a document between its labeled topics. 

'This is where the model's name arises: the observed feature count in each document is the convolution of (unob- 
served) topic-specific Poisson variates. 
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A word's topic- specific frequency, f3fk = exp fifk, is directly parameterized in the model and is 
regularized across words (via hyperparameters ip and 7 2 ) and across topics. A word's exclusivity 
to a topic, (f)f : k, is its usage rate relative to a set of comparison topics S: 0/^ = /3/,/c/ Yljes @fj- A 
topic's siblings are a natural choice for a comparison set to see which words are overexpressed in 
the topic compared to a set of similar topics. While not directly modeled in HPC, the exclusivity 
parameters are also regularized by the r 2 p , since if the child rates are forced to be similar then the 
4>/,k wiU be pushed toward a baseline value of 1/|«S|. We explore the regularization structure of the 
model empirically in Section 4. 

Since both frequency and exclusivity are important factors in determining a word's semantic 
content, a univariate measure of topical importance is a useful estimand for diverse tasks such as 
dimensionality reduction, feature selection, and content discovery. In constructing a composite 
measure, we do not want a high rank in one dimension to be able to compensate for a low rank in 
the other since frequency or exclusivity alone are not necessarily useful. We therefore adopt the 
harmonic mean to pull the "average" rank toward the lower score. For word / in topic k, we define 
the FREXf k score as the harmonic mean of the word's rank in the distribution of </> fc and /x 

FREX ( W 1 — w 

where w is the weight for exclusivity (which we set to 0.5 as a default) and ECDF X k is the empir- 
ical CDF function applied to the values x over the first index. 



3 Scalable inference via parallelized HMC sampler 

We use a Gibbs sampler to obtain the posterior expectations of the unknown rate and membership 
parameters (and associated hyperparameters) given the observed data. Specifically, inference is 
conditioned on W, a D x V matrix of word counts, J, a D x K matrix of topic labels, I, a 
D-vector of document lengths, and T, a tree structure for the topics. 

Creating a scalable inference method is critical since the space of latent variables grows linearly 
in the number of words and documents, with K(D + V) total unknowns. Our model offers an 
advantage in that the posterior consists of two groups of parameters whose conditional posterior 
factors given the other. On one side, the conditional posterior of the rate and variance parameters 
Tj }j =1 factors by word given the membership parameters and the hyperparameters ip, j 2 , v 
and a 2 . On the other, the conditional posterior of the topic affinity parameters {$,d}d=i factors by 
document given the hyperparameters rj and £ and the rate parameters {/ff}/=i- 

Conditional on the hyperparameters, therefore, we are left with two blocks of draws that can 
be broken into V or D independent threads. Using parallel computing software such as Message 
Passing Interface (MPI), the computation time for drawing the parameters in each block is only 
constrained by resources required for a single draw. The total runtime need not significantly in- 
crease with the addition of more documents or words as long as the number of available cores also 
increases. 
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Both of these conditional distributions are only known up to a constant and can be high dimen- 
sional if there are many topics, making direct sampling impossible and random walk Metropolis 
inefficient. We are able to obtain uncorrected draws through the use of Hamiltonian Monte Carlo 
(HMC) (Neal, 2011), which leverages the posterior gradient and Hessian to find a distant point 
in the parameter space with high probability of acceptance. HMC works well for log densities 
that are unimodal and have relatively constant curvature. We give step-by- step instructions for our 
implementation of the algorithm in the Appendix. 

After appropriate initialization, we follow a fixed Gibbs scan where the two blocks of latent 
variables are drawn in parallel from their conditional posteriors using HMC. We then draw the 
hyperparameters conditional on all the inputed latent variables. 



To set up the block Gibbs sampling algorithm, we derive the relavant conditional posterior distri- 
butions and explain how we sample from each. 

3.1.1 Updating tree parameters 

In the first block, the conditional posterior of the tree parameters factors by word: 



p({fJL fl T 2 f } V }=1 \ W, I, I, if,, 7 2 , °\ T) CX 

V f ° 1 



Given the conditional conjugacy of the variance parameters and their strong influence on the curva- 
ture of the rate parameter posterior, we sample the two groups conditional on each other to optimize 
HMC performance. Conditioning on the variance parameters, we can write the likelihood of the 
rate parameters as a Poisson regression where the documents are observations, the 0d(Id, $d) arQ 
the covariates, and the Id serve as exposure weights. 

The prior distribution of the rate parameters is a Gaussian graphical model, so a priori the log 
rates for each word are jointly Gaussian with mean ipl and precision matrix A(7 2 , r|, T) which 
has non-zero entries only for topic pairs that have a direct parent-child relationship. 2 The log 
conditional posterior is: 



3.1 Block Gibbs Sampler 




logp(fi f \W, I, I, {r]}Y =l ,i>, 7 2 , v, a 2 , {£ d }° =1 , T) 



D D 



d=l d=l 



2 In practice this precision matrix can be found easily as the negative Hessian of the log prior distribution. 
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We use HMC to sample from this unnormalized density. Note that the covariate matrix & DxK is 
very sparse in most cases, so we speed computation with a sparse matrix representation. 

We know the conditional distribution of the variance parameters due to the conjugacy of the 
Inverse-^ 2 prior with the normal distribution of the log rates. Specifically, if C(T) is the set of 
child topics of topic k with cardinality J, then 

r /fel^/,^^ Inv-x [J + v, J ~j— 

3.1.2 Updating topic affinity parameters 

In the second block, the conditional posterior of the topic affinity vectors factors by document: 

D ( V s 

P ({t i }£ =1 \w,iM»f}Y=i> r i>' s ) « n iip(M>fd\*d,u,iif,td) -PMC*) 

d=i ^ f=i * 

We can again write the likelihood as a Poisson regression, now with the rates as covariates. The 
log conditional posterior for one document is: 

logp(t i \W,I,l,{» f }% 1 ,ri,'E) = 

V V K 

- i d pjo d + E w f* lo s 09/ o*) - E lo §( 1 + e_?dfc ) 
f=i f=i k=i 

K i 

- ^(1 - I*)U - ^(^ - rif^Od ~ •?)■ 

k=l 

We use HMC to sample from this unnormalized density. Here the parameter vector d is sparse 
rather than the covariate matrix ByxK- If we remove the entries of Q d and columns of B pertaining 
to topics k where I d k = 0, then we are left with a low dimensional regression where only the active 
topics are used as covariates, greatly simplifying computation. 



3.1.3 Updating corpus-level parameters 

We draw the hyperparameters after each iteration of the block update. We put flat priors on these 
unknowns so that we can learn their most likely values from the data. As a result, their conditional 
posteriors only depend on the latent variables they generate. 

The log corpus-level rates /i/ i0 for each word follow a Gaussian distribution with mean ip and 
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variance 7 2 . The conditional distribution of these hyperparameters is available in closed form: 

and 7 2 |V, {a*/,o}]Li ~ I™-* 2 (v, £ E/=i(^/,o - ^) 2 ) • 

The discrimination parameters r| fc independently follow an identical Scaled Inverse-^ 2 with 
convolution parameter v and scale parameter a 2 , while their inverse follows a Gamma(K T = 
|> -V = ^2) distribution. We use HMC to sample from this unnormalized density. Specifically, 



v 

-2 \-l 



logp(K r , X T \{r}}J =l , T) = ^-l)EE l0 S ( r /^ 

/=i fceP 

1 ^ 

- \V\VK T \og\ T - \V\V\ogT(K T ) T £EW"' 

T /=i feeP 

where P(T) is the set of parent topics on the tree. Each draw of (k t , A t ) is then transformed back 
to the (z/, a 2 ) scale. 

The document- specific topic affinity parameters £ d follow a Multivariate Normal distribution 
with mean parameter rj and a covariance matrix parameterized in terms of a scalar, S = A 2 /^. 
The conditional distribution of these hyperparameters is available in closed form. For efficiency, 
we choose to put a flat prior on log A 2 rather than the original scale, which allows us to marginalize 
out 77 from the conditional posterior of A 2 : 

A 2 |{Uii~Inv-X 2 (W-l, ErfE ^y" fc)2 )» 
and ^AMU^i-A^, 



3.2 Estimation 

As discussed in Section 2.3, our estimands are the topic-specific frequency and exclusivity of the 
words in the vocabulary, as well as the FREX score that averages each word's performance in 
these dimensions. We use posterior means to estimate frequency and exclusivity, computing these 
quantities at every iteration of the Gibbs sampler and averaging the draws after the burn-in period. 
For the FREX score, we applied the ECDF function to the frequency and exclusivity posterior 
expectations of all words in the vocabulary to estimate the true ECDF. 
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3.3 Inference for unlabeled documents 

In order to classify unlabeled documents, we need to find the posterior predictive distribution of 
the membership vector Jj for a new document d. Inference is based on the new document's word 
counts and the unknown parameters, which we hold constant at their posterior expectation. 
Unfortunately, the posterior predictive distribution of the topic affinities £j is intractable without 
conditioning on the label vector since the labels control which topics contribute content. We there- 
fore use a simpler model where the topic proportions depend only on the relative size of the affinity 
parameters: 

e idk / i \ 

- »• I* ~ Bem ( 1+H[p( _ u) J • 

The posterior predictive distribution of this simpler model factors into tractable components: 

P *(i 3 , w, i) » p(iMi) fiZMAU' ^ w s) 

« p(iMs) p*(«jlfe {A/}J=i) p(&W, s). 

It is then possible to find the most likely ^ based on the evidence from alone. 

4 Results 

We analyze the fit of the HPC model to Reuters Corpus Volume I (RCV1), a large collection of 
news wire stories. First, we demonstrate how the variance parameters rL regularize the exclusivity 
with which words are expressed within topics. Second, we show that regularization of exclu- 
sivity has the greatest effect on infrequent words. Third, we explore the joint posterior of the 
topic- specific frequency and exclusivity of words as a summary of topical content, giving special 
attention to the upper right corner of the plot where words score highly in both dimensions. We 
compare words that score highly on the FREX metric to top words scored by frequency alone, the 
current practice in topic modeling. Finally, we compare the classification performance of HPC to 
baseline models. 

4.1 The Reuters Corpus dataset 

RCVl is an archive of 806,791 newswire stories from a twelve-month period in 1996-1997. 3 As 
described in Lewis et al. (2004), Reuters staffers assigned stories into any subset of 102 hierarchical 
topic categories. In the original data, assignment to any topic required automatic assignment to all 
ancestor nodes, but we removed these redundant ancestor labels since they do not allow our model 
to distinguish intentional assignments to high level categories from assignment to their offspring. 
In our modified annotations, the only documents we see in high level topics are those labeled to 
them and none of their children, which maps onto general content. We preprocessed document 

3 Available upon request from the National Institute of Standards and Technology (NIST), 

http://trec.nist. gov/ data/ reuters/reuters . html 
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Figure 2: Topic hierarchy of Reuters corpus 




tokens with the Porter stemming algorithm (getting 300,166 unique stems) and chose the most 
frequent 3% of stems (10,421 unique stems, over 100 million total tokens) for the feature set. 4 

The Reuters topic hierarchy has three levels that divide the content into finer categories at each 
cut. At the first level, content is divided between four high level categories: three that focus 
on business and market news (Markets, Corporate/Industrial, and Economics) and one grab bag 
category that collects all remaining topics from politics to entertainment (Government/Social). The 
second level provides fine-grained divisions of these broad categories and contains the terminal 
nodes for most branches of the tree. For example, the Markets topic is split between equity, bond, 
money, and commodity markets at the second level. The third level offers further subcategories 
where needed for a small set of second level topics. For example, the Commodity Markets topic 
is divided between agricultural (soft), metal, and energy commodities. We present a graphical 
illustration of the Reuters topic hierarchy in Figure 2. 

Many documents in the Reuters corpus are labeled to multiple topics, even after redundant an- 
cestor memberships are removed. Overall, 32% of the documents are labeled to more than one 
node of the topic hierarchy. Fifteen percent of documents have very diverse content, being labeled 
to two or more of the main branches of the tree (Markets, Commerce, Economics, and Govern- 
ment/Social). Twenty-one percent of documents are labeled to multiple second-level categories on 
the same branch (for example, bond markets and equity markets in the Markets branch). Finally, 
14% of documents are labeled to multiple children of the same second-level topic (for example, 
metals trading and energy markets in the commodity markets branch of Markets). Therefore, a 
completely general mixed membership model such as HPC is necessary to capture the labeling 

including rarer features did not meaningfully change the results. 
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Table 2: Topic membership statistics 



Topic code 


Topic name 


# docs 


Any MM 


CB LI MM 


CB L2 MM 


CB L3 MM 


CCAT 


CORPORATE/INDUSTRIAL 


2170 


79.60% 


79.60% 


13.10% 


0.80% 


Cll 


STRATEGY/PLANS 


24325 


51.50 


11.50 


44.50 


4.50 


C12 


LEGAL/JUDICIAL 


11944 


99.20 


98.90 


50.20 


1.70 


C13 


REGULATION/POLICY 


37410 


85.90 


55.60 


61.40 


4.50 


C14 


SHARE LISTINGS 


7410 


30.30 


7.90 


10.30 


15.80 


C15 


PERFORMANCE 


229 


82.10 


35.80 


74.20 


1.70 


C151 


ACCOUNTS/EARNINGS 


81891 


7.90 


1.30 


0.60 


6.40 


C152 


COMMENT/FORECASTS 


73092 


18.90 


4.80 


1.60 


13.50 


C16 


INSOLVENCY/LIQUIDITY 


1920 


66.70 


31.50 


54.60 


3.60 


C17 


FUNDING/CAPITAL 


4767 


78.10 


41.40 


67.70 


5.00 


C171 


SHARE CAPITAL 


18313 


44.60 


3.20 


1.70 


41.50 


C172 


BONDS/DEBT ISSUES 


11487 


15.10 


5.70 


0.30 


9.70 


C173 


LOANS/CREDITS 


2636 


24.70 


8.50 


3.60 


15.60 


C174 


CREDIT RATINGS 


5871 


65.60 


59.00 


0.50 


7.50 


C18 


OWNERSHIP CHANGES 


30 


76.70 


23.30 


76.70 


3.30 


C181 


MERGERS/ACQUISITIONS 


43374 


34.40 


6.50 


4.80 


26.90 


C182 


ASSET TRANSFERS 


4671 


28.30 


4.70 


5.70 


21.00 


C183 


PRIVATISATIONS 


7406 


73.70 


34.20 


6.30 


44.10 


C21 


PRODUCTION/SERVICES 


25403 


76.40 


46.50 


53.60 


0.80 


C22 


NEW PRODUCTS/SERVICES 


6119 


55.00 


15.30 


49.10 


0.40 


C23 


RESEARCH/DEVELOPMENT 


2625 


77.00 


36.40 


57.80 


0.90 


C24 


CAPACITY/FACILITIES 


32153 


72.20 


33.60 


58.40 


0.90 


C31 


MARKETS/MARKETING 


29073 


46.90 


25.30 


34.60 


1.30 


C311 


DOMESTIC MARKETS 


4299 


80.60 


73.70 


9.50 


18.70 


C312 


EXTERNAL MARKETS 


6648 


78.10 


70.40 


9.60 


14.20 


C313 


MARKET SHARE 


1115 


39.70 


10.30 


5.10 


27.80 


C32 


ADVERTISING/PROMOTION 


2084 


63.80 


26.90 


52.50 


1.40 


C33 


CONTRACTS/ORDERS 


14122 


48.00 


12.60 


40.50 


0.80 


C331 


DEFENCE CONTRACTS 


1210 


68.00 


65.50 


13.30 


3.40 


C34 


MONOPOLIES/COMPETITION 


4835 


92.30 


54.90 


75.70 


14.00 


C41 


MANAGEMENT 


1083 


75.60 


52.10 


59.90 


2.00 


C411 


MANAGEMENT MOVES 


10272 


17.70 


9.60 


2.40 


8.20 


C42 


LABOUR 


11878 


99.70 


99.60 


46.50 


1.50 


ECAT 


ECONOMICS 


621 


90.50 


90.50 


9.70 


1.40 


Ell 


ECONOMIC PERFORMANCE 


8568 


43.00 


24.20 


29.10 


5.10 


E12 


MONETARY/ECONOMIC 


24918 


81.70 


75.40 


17.90 


13.70 


E121 


MONEY SUPPLY 


2182 


30.50 


23.10 


0.70 


9.20 


E13 


INFLATION/PRICES 


130 


60.00 


46.90 


28.50 


0.80 


E131 


CONSUMER PRICES 


5659 


24.70 


15.60 


6.00 


12.00 


E132 


WHOLESALE PRICES 


939 


19.00 


3.40 


0.60 


16.90 


E14 


CONSUMER FINANCE 


428 


73.80 


43.20 


61.00 


1.60 


E141 


PERSONAL INCOME 


376 


75.00 


63.80 


9.60 


22.30 


E142 


CONSUMER CREDIT 


200 


46.00 


30.00 


3.50 


18.50 


E143 


RETAIL SALES 


1206 


27.50 


19.70 


2.40 


10.20 


E21 


GOVERNMENT FINANCE 


941 


86.70 


81.40 


53.90 


4.00 


E211 


EXPENDITURE/REVENUE 


15768 


78.20 


72.40 


16.10 


13.80 


E212 


GOVERNMENT BORROWING 


27405 


32.70 


29.60 


2.70 


4.50 


E31 


OUTPUT/CAPACITY 


591 


45.20 


18.30 


35.20 


0.50 


E311 


INDUSTRIAL PRODUCTION 


1701 


17.70 


9.80 


3.10 


9.30 


E312 


CAPACITY UTILIZATION 


52 


65.40 


13.50 


3.80 


57.70 


E313 


INVENTORIES 


111 


26.10 


10.80 


0.00 


16.20 


E41 


EMPLOYMENT/LABOUR 


14899 


100.00 


100.00 


49.40 


2.20 


E411 


UNEMPLOYMENT 


2136 


92.00 


90.60 


10.40 


12.00 


E51 


TRADE/RESERVES 


4015 


85.10 


75.50 


38.70 


1.90 


E511 


BALANCE OF PAYMENTS 


2933 


63.80 


43.70 


8.20 


25.70 


E512 


MERCHANDISE TRADE 


12634 


64.90 


59.10 


11.50 


11.70 


E513 


RESERVES 


2290 


30.10 


22.70 


1.30 


16.80 


E61 


HOUSING STARTS 


391 


51.70 


47.80 


13.80 


0.80 


E71 


LEADING INDICATORS 


5270 


2.90 


0.60 


2.40 


0.20 



Key: MM = Mixed membership, CB Lx = Cross-branch MM at level x 
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Table 3: Topic membership statistics, con't 



Topic code 


Topic name 


# docs 


Any MM 


CB LI MM 


CB L2 MM 


CB L3 MM 


GCAT 


GOVERNMENT/SOCIAL 


24546 


2.50 


2.50 


0.50 


0.10 


G15 


EUROPEAN COMMUNITY 


1545 


16.10 


6.90 


14.60 


0.00 


G151 


EC INTERNAL MARKET 


3307 


98.00 


87.20 


10.60 


94.30 


G152 


EC CORPORATE POLICY 


2107 


96.70 


90.70 


40.30 


50.30 


G153 


EC AGRICULTURE POLICY 


2360 


96.10 


94.20 


31.40 


27.70 


G154 


EC MONETARY/ECONOMIC 


8404 


98.20 


93.00 


11.50 


43.90 


G155 


EC INSTITUTIONS 


2124 


70.80 


42.00 


24.30 


54.00 


G156 


EC ENVIRONMENT ISSUES 


260 


75.00 


57.70 


28.80 


50.80 


G157 


EC COMPETITION/SUBSIDY 


2036 


100.00 


99.80 


60.20 


32.50 


G158 


EC EXTERNAL RELATIONS 


4300 


80.70 


62.80 


27.00 


24.80 


G159 


EC GENERAL 


40 


47.50 


17.50 


35.00 


2.50 


GCRIM 


CRIME, LAW ENFORCEMENT 


32219 


79.50 


41.60 


59.40 


0.90 


GDEF 


DEFENCE 


8842 


93.70 


17.20 


84.40 


0.50 


GDIP 


INTERNATIONAL RELATIONS 


37739 


73.70 


20.50 


60.70 


0.90 


GDIS 


DISASTERS AND ACCIDENTS 


8657 


75.70 


40.10 


52.20 


0.20 


GENT 


ARTS, CULTURE, ENTERTAINMENT 


3801 


68.80 


29.20 


49.60 


0.50 


GENV 


ENVIRONMENT AND NATURAL WORLD 


6261 


90.20 


51.50 


72.30 


2.50 


GFAS 


FASHION 


313 


76.40 


45.70 


41.50 


1.90 


GHEA 


HEALTH 


6030 


81.90 


56.10 


65.00 


1.20 


GJOB 


LABOUR ISSUES 


17241 


99.60 


99.40 


44.60 


3.30 


GMIL 


MILLENNIUM ISSUES 


5 


100.00 


100.00 


40.00 


0.00 


GOBIT 


OBITUARIES 


844 


99.40 


15.30 


99.40 


0.00 


GODD 


HUMAN INTEREST 


2802 


60.70 


9.70 


55.20 


0.10 


GPOL 


DOMESTIC POLITICS 


56878 


79.60 


29.70 


63.00 


1.80 


GPRO 


BIOGRAPHIES, PERSONALITIES, PEOPLE 


5498 


87.50 


10.00 


84.70 


0.10 


GREL 


RELIGION 


2849 


86.10 


6.60 


84.30 


0.10 


GSCI 


SCIENCE AND TECHNOLOGY 


2410 


55.20 


22.20 


45.10 


0.30 


GSPO 


SPORTS 


35317 


1.30 


0.60 


0.90 


0.00 


GTOUR 


TRAVEL AND TOURISM 


680 


89.60 


69.70 


34.70 


3.40 


GVIO 


WAR, CIVIL WAR 


32615 


67.30 


10.10 


64.60 


0.10 


GVOTE 


ELECTIONS 


11532 


100.00 


13.30 


100.00 


1.30 


GWEA 


WEATHER 


3878 


73.90 


46.80 


46.40 


0.10 


GWELF 


WELFARE, SOCIAL SERVICES 


1869 


95.40 


75.50 


74.10 


3.40 


MCAT 


MARKETS 


894 


81.10 


81.10 


14.50 


2.20 


Mil 


EQUITY MARKETS 


48700 


16.30 


12.30 


3.90 


2.90 


M12 


BOND MARKETS 


26036 


21.30 


15.60 


5.20 


3.50 


M13 


MONEY MARKETS 


447 


65.80 


51.90 


23.30 


1.60 


M131 


INTERBANK MARKETS 


28185 


15.10 


9.40 


0.70 


6.40 


M132 


FOREX MARKETS 


26752 


36.90 


24.70 


3.10 


16.10 


M14 


COMMODITY MARKETS 


4732 


18.00 


16.70 


2.30 


0.10 


M141 


SOFT COMMODITIES 


47708 


24.10 


22.80 


5.50 


2.00 


M142 


METALS TRADING 


12136 


34.70 


19.30 


4.10 


16.10 


M143 


ENERGY MARKETS 


21957 


21.10 


18.40 


4.80 


2.90 



Key: MM = Mixed membership, CB Lx = Cross-branch MM at level x 



patterns of the corpus. A full breakdown of membership statistics by topic is presented in Tables 2 
and 3. 



4.2 How the differential usage parameters regulate topic exclusivity 

A word can only be exclusive to a topic if its expression across the sibling topics is allowed to 
diverge from the parent rate. Therefore, we would only expect words with high differential usage 
parameters r| at the parent level to be candidates for highly exclusive expression <f>fk in any 
child topic k. Words with child topic rates that cannot vary greatly from the parent should have 
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Figure 3: Exclusivity as a function of differential usage parameters 



Differential-Exclusivity plot for MARKETS Differential-Exclusivity plot for PERFORMANCE 




Differential usage from parent: log(xf p ) Differential usage from parent: log(T^) 



nearly equal expression in each child k, meaning <pf k « ^ for a branch with C child topics. An 
important consequence is that, although the </>//. are not directly modeled in HPC, their distribution 
is regularized by learning a prior distribution on the rj. 

This tight relation can be seen in the HPC fit. Figure 3 shows the joint posterior expectation 
of the differential usage parameters in a parent topic and exclusivity parameters across the child 
topics. Specifically, the left panel compares the rate variance of the children of Markets from 
their parent to exclusivity between the child topics; the right panel does the same with the two 
children of Performance, a second-level topic under the Corporate category. The plots have similar 
patterns. For low levels of differential expression, the exclusivity parameters are clustered around 
the baseline value, ^. At high levels of child rate variance, words gain the ability to approach 
exclusive expression in a single topic. 



4.3 How frequency modulates regularization of exclusivity 

One of the most appealing aspects of regularization in generative models is that it acts most 
strongly on the parameters for which we have the least information. In the case of the exclu- 
sivity parameters in HPC we have the most data for frequent words, so for a given topic the words 
with low rates should be least able to escape regularization of their exclusivity parameters by our 
shrinkage prior on the parent's rj p . 

Figure 4 shows for two topics the joint posterior expectation of each word's frequency in that 
topic and its exclusivity compared to sibling topics (the FREX plot). The left panel features the 
Science and Technology topic, a child in the grab bag Government/Social branch, and the right 
panel features the Research/Development topic, a child in the Corporate branch. The overall shape 
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Figure 4: Frequency-Exclusivity (FREX) plots 




FREX plot for RESEARCH/DEVELOPMENT 




~~ i i r 

-10 -5 

Frequency: |i )k 
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of the joint posterior is very similar for both topics. On the left side of the plots, the exclusivity 
of rare words is unable to significantly exceed the ^ baseline. This is because the model does not 
have much evidence to estimate usage in the topic, so the estimated rate is shrunk heavily toward 
the parent rate. However, we see that it is possible for rare words to be underexpressed in a topic, 
which happens if they are frequent and overexpressed in a sibling topic. Even though their rates 
are similar to the parent in this topic, sibling topics may have a much higher rate and account for 
most appearances of the word in the comparison group. 

4.4 Frequency and Exclusivity as a two dimensional summary of semantic 
content 

Words in the upper right of the FREX plot — those that are both frequent and highly exclusive — 
are of greatest interest. These are the most common words in the corpus that are also likely to 
have been generated from the topic of interest (rather than similar topics). We show words in the 
upper 5% quantiles in both dimensions for our example topics in Figure 5. These high-scoring 
words can help to clarify content even for labeled topics. In the Science and Technology topic, we 
see almost all terms are specific to the American and Russian space programs. Similarly, in the 
Research/Technology topic, almost all terms relate to clinical trials in medicine or to agricultural 
research. 

We also compute the Frequency-Exclusivity (FREX) score for each word-topic pair, a univariate 
summary of topical content that averages performance in both dimensions. In Table 4 we compare 
the top FREX words in three topics to a ranking based on frequency alone, which is the current 
practice in topic modeling. For context, we also show the immediate neighbors of each topic in 
the tree. The topic being examined is in bolded red, while the borders of the comparison set are 
solid. The Defense Contracts topic is a special case since it is an only child. In these cases, we use 
a comparison to the parent topic to calculate exclusivity. 

By incorporating exclusivity information, FREX-ranked lists include fewer words that are used 
similarly everywhere (such as said and would) and fewer words that are used similarly in a set of 
related topics (such as price and market in the Markets branch). One can understand this result by 
comparing the rankings for known stop words from the SMART list to other words. In Figure 6, we 
show the maximum ECDF ranking for each word across topics in the distribution of frequency (left 
panel) and exclusivity (right panel) estimates. One can see that while stop words are more likely 
to be in the extreme quantiles of frequency, very few of them are among the most exclusive words. 
This prevents general and context- specific stop words from ranking highly in a FREX-based index. 

4.5 Classification performance 

We compare the classification performance of HPC with SVM and L2-regularized logistic regres- 
sion (Genkin et al., 2007; Rubin et al., 2012; Ghamrawi and McCallum, 2005). All methods were 
trained on a random sample of 15% of the documents using the 3% most frequent words in the 
corpus as features. These fits were used to predict memberships in the withheld documents, an 
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Figure 5: Upper right corner of FREX plot 
Upper 5% of FREX plot for SCIENCE AND TECHNOLOGY 
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Table 4: Comparison of High FREX words (both frequent and exclusive) to most frequent words 
(featured topic name bold red; comparison set in solid ovals) 



High FREX Most frequent 



8- 
a 

CfQ 



I 

3 

3 

3 

n 
B 



O 
3 



n 

o 

3 

65 



copper 
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price 
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Markets ; 




greenpeac 
environment 
pollut 

wast 
emiss 
reactor 
forest 

speci 
environ 
eleph 

spill 
wildlif 
energi 
nuclear 



said 
would 
environment 
year 
state 
nuclear 
million 
greenpeac 
world 
water 
group 
govern 
nation 
environ 



[ Gov't/Social ; 
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Figure 6: Comparison of FREX score components for SMART stop words vs. regular words 



Density of maximum frequency ECDF over all topics Density of maximum exclusivity ECDF over all topics 




02468 10 2 4 6 8 10 
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experiment we repeated ten times with a new random sample as a training set. Table 5 shows 
the results of our experiment, using both micro averages (every document weighted equally) and 
macro averages (every topic weighted equally). While HPC does not dominate other methods, on 
average its performance does not deviate significantly from traditional classification algorithms. 

HPC is not designed for optimizing predictive accuracy out-of- sample, rather it is designed to 
maximize interpretability of the label- specific summaries, in terms of words that are both frequent 
and exclusive. These results offer a quantitative illustration of the classical trade-off between 
predictive and explanatory power of statistical models (Breiman, 2001). 



Table 5: Classification performance for ten-fold cross-validation 





SVM 


L2-reg Logit 


HPC 


Micro-ave Precision 
Micro-ave Recall 


0.711 (0.002) 
0.706 (0.001) 


0.195 (0.031) 
0.768 (0.013) 


0.695 (0.007) 
0.589 (0.008) 


Macro-ave Precision 
Macro- ave Recall 


0.563 (0.002) 
0.551 (0.006) 


0.481 (0.025) 
0.600 (0.007) 


0.505 (0.094) 
0.524 (0.093) 



Standard deviation of performance over ten folds in parenthesis. 



22 



5 Discussion 



Our thesis is that one needs to know how words are used differentially across topics as well as 
within them in order to understand topical content; we refer to these dimensions of content as 
word exclusivity and frequency. Topical summaries that focus on word frequency alone are often 
dominated by stop words or other terms used similarly across many topics. Exclusivity and fre- 
quency can be visualized graphically as a latent space or combined into an index such as the FREX 
score to obtain a univariate measure of the topical content for words in each topic. 

Naive estimates of exclusivity will be biased toward rare words due to sensitivity to small 
differences in estimated use across topics. Existing topic models such as LDA cannot regularize 
differential use due to topic normalization of usage rates; its symmetric Dirichlet prior on topic 
distributions regularizes within, not between, topic usage. While topic-regularized models can 
capture many important facets of word usage, they are not optimal for the estimands used in our 
analysis of topical content. 

HPC breaks from standard topic models by modeling topic- specific word counts as unnormal- 
ized count variates whose rates can be regularized both within and across topics to compute word 
frequency and exclusivity. It was specifically designed to produce stable exclusivity estimates in 
human- annotated corpora by smoothing differential word usage according to a semantically intel- 
ligent distance metric: proximity on a known hierarchy. This supervised setting is an ideal test case 
for our framework and will be applicable to many high value corpora such as the ACM library, IMS 
publications, the New York Times and Reuters, which all have professional editors and authors and 
provide multiple annotations to a hierarchy of labels for each document. 

HPC offers a complex challenge for full Bayesian inference. To offer a flexible framework for 
regularization, it breaks from the simple Dirichlet-Multinomial conjugacy of traditional models. 
Specifically, HPC uses Poisson likelihoods whose rates are smoothed across a known topic hier- 
archy with a Gaussian diffusion and a novel mixed membership model where document label and 
topic membership parameters share a Gaussian prior. The membership model is the first to create 
an explicit link between the distribution of topic labels in a document and of the words that appear 
in a document and allow for multiple labels. However, the resulting inference is challenging since, 
conditional on word usage rates, the posterior of the membership parameters involves Poisson and 
Bernoulli likelihoods of differing dimensions constrained by a Gaussian prior. 

We offer two methodological innovations to make inference tractable. First, we design our 
model with parameters that divide cleanly into two blocks (the tree and document parameters) 
whose members are conditionally independent given the other block, allowing for parallelized, 
scalable inference. However, these factorized distributions cannot be normalized analytically and 
are the same dimension as the number of topics (102 in the case of Reuters). We therefore imple- 
ment a Hamiltonian Monte Carlo conditional sampler that mixes efficiently through high dimen- 
sional spaces by leveraging the posterior gradient and Hessian information. This allows HPC to 
scale to large and complex topic hierarchies that would be intractable for Random Walk Metropolis 
samplers. 

One unresolved bottleneck in our inference strategy is that the MCMC sampler mixes slowly 
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through the hyperparameter space of the documents — the rj and A 2 parameters that control the 
mean and sparsity of topic memberships and labels. This is due to a large fraction of missing infor- 
mation in our augmentation strategy (Meng and Rubin, 1991). Conditional on all the documents' 
topic affinity parameters {£,d}d=i> these hyperparameters index a normal distribution with D obser- 
vations; marginally, however, we have much less information about the exact loading of each topic 
onto each document. While we have been exploring more efficient data augmentation strategies 
such as Parameter Expansion (Liu and Wu, 1999), we have not found a workable alternative to 
augmenting the posterior with the entire set of {£d}d=i parameters. 

5.1 Concluding remarks 

While HPC was developed for the specific case of hierarchically labeled document collections, 
this framework can be readily extended to other types of document corpora. For labeled corpora 
where no hierarchical structure on the topics is available, one can use a flat hierarchy to model 
differential use. For document corpora where no labeled examples are available, a simple word 
rate model with a flat hierarchy and dense topic membership structure could be employed to get 
more informative summaries of inferred topics. In either case, the word rate framework could 
be combined with non-parameteric Bayesian models that infer hierarchical structure on the topics 
(Adams et al., 2010). We expect modeling approaches based on rates will play an important role 
in future work on text summarization. 

The HPC model can also be leveraged to semi-automate the construction of topic ontologies 
targeted to specific domains, for instance, when fit to comprehensive human- annotated corpora 
such as Wikipedia, The New York Times, Encyclopedia Britannica, or databases such as JSTOR and 
the ACM repository. By learning a probabilistic representation of high quality topics, HPC output 
can be used as a gold standard to aid and evaluate other learning methods. Targeted ontologies have 
been a key factor in monitoring scientific progress in biology (Ashburner et al., 2000; Kanehisa 
and Goto, 2000). A hierarchical ontology of topics would lead to new metrics for measuring 
progress in text analysis. It would enable an evaluation of the semantic content of any collection 
of inferred topics, thus finally allowing for a quantitative comparison among the output of topic 
models. Current evaluations are qualitative, anecdotal and unsatisfactory; for instance, authors 
argue that lists of most frequent words describing an arbitrary selection of topics inferred by a new 
model make sense intuitively, or that they are better then lists obtained with other models. 

In addition to model evaluation, a news-specific ontology could be used use as prior to inform 
the analysis of unstructured text, including Twitter feeds, Facebook wall posts, and blogs. Unsu- 
pervised topic models infer a latent topic space that may be oriented around unhelpful axes, such 
as authorship or geography. Using a human-created ontology as a prior could ensure that a useful 
topic space is discovered without being so dogmatic as to assume that unlabeled documents have 
the same latent structure as labeled examples. 



24 



References 



R. P. Adams, Z. Ghahramani, and M. I. Jordan. Tree-structured stick breaking for hierarchical 
data. In J. Shawe-Taylor, R. Zemel, J. Lafferty, and C. Williams, editors, Advances in Neural 
Information Processing (NIPS) 23, 2010. 

E. M. Airoldi, W. W. Cohen, and S. E. Feinberg. Bayesian methods for frequent terms in text: 
Models of contagion and the delta-square statistic. CSNA and INTERFACE Annual Meetings, 
2005. 

E. M. Airoldi, A. G. Anderson, S. E. Fienberg, and K. K. Skinner. Who wrote Ronald Reagan's 
radio addresses? Bayesian Analysis, l(2):289-320, 2006. 

E. M. Airoldi, S. E. Fienberg, and K. K. Skinner. Whose ideas? Whose words? Authorship of the 
Ronald Reagan radio addresses. Political Science & Politics, 40:501-506, 2007. 

E. M. Airoldi, D. M. Blei, S.E. Fienberg, and E.R Xing. Mixed-membership stochastic blockmod- 
els. Journal of Machine Learning Research, 9:1981-2014, 2008. 

E. M. Airoldi, E. A. Erosheva, S. E. Fienberg, C. J. Joutard, T. M. Love, and S. Shringarpure. 
Reconceptualizing the classification of pnas articles. PNAS, 107, 2010. 

M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. R Davis, K. Dolinski, 
S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. 
Matese, J. E. Richardson, M. Ringwald, G. M. Rubinand, and G. Sherlock. Gene ontology: 
Tool for the unification of biology. The gene ontology consortium. Nature Genetics, 25(1): 
25-29,2000. 

Jonathan Bischof and Edoardo Airoldi. Summarizing topical content with word frequency and 
exclusivity. ICML, 2012. 

D. Blei. Introduction to probabilistic topic models. Communications of the ACM, 2012. In press. 

D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested 
Chinese restaurant process. NIPS, 2003a. 

David Blei and John McAuliffe. Supervised topic models, volume 21. Neural Information Pro- 
cessing Systems, 2007. 

David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of Machine 
Learning Research, 2003b. 

L. Breiman. Statistical modeling: The two cultures. Statistical Science, 16(3): 199-231, 2001. 

John Canny. GAP: A Factor Model for Discrete Data. SIGIR, 2004. 

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David Blei. Reading tea 
leaves: How humans interpret topic models. Neural Information Processing Systems, 2009. 



25 



Jacob Eisenstein, Amr Ahmed, and Eric P. Xing. Sparse Additive Generative Models of Text. 
ICML, 2011. 

Alexander Genkin, David D. Lewis, and David Madigan. Large-scale bayesian logistic regression 
for text categorization. Technometrics, 49, 2007. 

Nadia Ghamrawi and Andrew McCallum. Collective multi-label classification. Fourteenth Con- 
ference on Information and Knowledge Management (CIKM), 2005. 

Justin Grimmer and Gary King. General purpose computer-assisted clustering and conceptualiza- 
tion. PNAS, 2011. 

H. Hotelling. Relations between two sets of variants. Biometrika, 28:321-377, 1936. 

Yuening Hu, Jordan Boyd-Graber, and Brianna Satinoff. Interactive Topic Modeling. Association 
for Computational Linguistics, 2011. 

I. T. Jolliffe. Principal Component Analysis. Springer- Verlag, 1986. 

M. Kanehisa and S. Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids 
Research, 28(l):27-30, 2000. 

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A New Benchmark Collection 
for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. 

Jun S . Liu and Ying Nian Wu. Parameter expansion for data augmentation. Journal of the American 
Statistical Association, 94:1264-1274, 1999. 

Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Ng. Improving text classifi- 
cation by shrinkage in a hierarchy of classes. International Conference on Machine Learning, 
1998. 

Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley, 2000. 

Xiao-Li Meng and Donald Rubin. Using em to obtain asymptotic variance-covariance matrices: 
The sem algorithm. Journal of the American Statistical Association, 86:899-909, 1991. 

David Mimno, Wei Li, and Andrew McCallum. Mixtures of hierarchical topics with pachinko 
allocation. ICML, 2007. 

David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Opti- 
mizing Semantic Coherence in Topic Models. EMNLP, 2011. 

Burt Monroe, Michael Colaresi, and Kevin Quinn. Fightin' words: Lexical feature selection and 
evaluation for identifying the content of political conflict. Political Analysis, 16:372-403, 2008. 

F. Mosteller and D.L. Wallace. Applied Bayesian and Classical Inference: The Case of "The 
Federalist" Papers. Springer- Verlag, 1984. 



26 



Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. 
Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman & Hall 
/CRC Press, 2011. 

Adler Perotte, Nicholas Bartlett, Noemie Elhadad, and Frank Wood. Hierarchically Supervised 
Latent Dirichlet Allocation. NIPS, 2012. 

Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Labeled LDA: A 
supervised topic model for credit attribution in multi-labeled corpora. EMNLP, 2009. 

T. Rubin, A. Chambers, P. Smyth, and M. Steyvers. Statistical topic models for multi-label docu- 
ment classification. Machine Learning, 88, 2012. 

Kyung-Ah Sohn and Eric P. Xing. A hierarchical dirichlet process mixture model for haplotype 
reconstruction from multi-population data. Annals of Applied Statistics, 3:791-821, 2009. 

Hanna Wallach, David Mimno, and Andrew McCallum. Rethinking LDA: Why Priors Matter. 
NIPS, 2009. 

Jun Zhu and Eric P. Xing. Sparse Topical Coding. UAI, 2011. 



27 



A Appendix: Implementing the parallelized HMC sampler 



A.l Hamiltonian Monte Carlo conditional updates 

Hamiltonian Monte Carlo (HMC) is the key tool that makes high-dimensional, non-conjugate up- 
dates tractable for our Gibbs sampler. It works well for log densities that are unimodal and have 
relatively constant curvature. We outline our customized implementation of the algorithm here; a 
general introduction can be found in Neal (201 1). 

HMC is a version of the Metropolis-Hastings algorithm that replaces the common Multivariate 
Normal proposal distribution with a distribution based on Hamiltonian dynamics. It can be used 
to make joint proposals on the entire parameter space or, as in this paper, to make proposals along 
the conditional posteriors as part of a Gibbs scan. While it requires closed form calculation of 
the posterior gradient and curvature to perform well, the algorithm can produce uncorrelated or 
negatively correlated draws from the target distribution that are almost always accepted. 

A consequence of classical mechanics, Hamiltonian's equations can be used to model the move- 
ment of a particle along a frictionless surface. The total energy of the particle is the sum of its 
potential energy (the height of the surface relative to the minimum at the current position) and its 
kinetic energy (the amount of work needed to accelerate the particle from rest to its current veloc- 
ity). Since energy is preserved in a closed system, the particle can only convert potential energy to 
kinetic (or vice versa) as it moves along the surface. 

Imagine a ball placed high on the side of the parabola f(q) = q 2 at position q = —2. Starting 
out, it will have no kinetic energy but significant potential energy due to its position. As it rolls 
down the parabola toward zero, it speeds up (gaining kinetic energy), but loses potential energy 
to compensate as it moves to a lower position. At the bottom of the parabola the ball has only 
kinetic energy, which it then translates back into potential energy by rolling up the other side until 
its kinetic energy is exhausted. It will then roll back down the side it just climbed, completely 
reversing its trajectory until it returns to its original position. 

HMC uses Hamiltonian dynamics as a method to find a distant point in the parameter space 
with high probability of acceptance. Suppose we want to produce samples from f(q), a possibly 
unnormalized density. Since we want high probability regions to have the least potential energy, 
we parameterize the surface the particle moves along as U(q) = — log f(q), which is the height 
of the surface and the potential energy of the particle at any position q. The total energy of the 
particle, H(p,q), is the sum of its kinetic energy, K(p), and its potential energy, U(q), where 
p is its momentum along each coordinate. After drawing an initial momentum for the particle 
(typically chosen as p ~ jV(0, M), where M is called the mass matrix), we allow the system to 
evolve for a period of time — not so little that the there is negligible absolute movement, but not so 
much that the particle has time to roll back to where it started. 

HMC will not generate good proposals if the particle is not given enough momentum in each 
direction to efficiently explore the parameter space in a fixed window of time. The higher the 
curvature of the surface, the more energy the particle needs to move to a distant point. Therefore 
the performance of the algorithm depends on having a good estimate of the posterior curvature 
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H(q) and drawing p ~ A/"(0, — H(q)). If the estimated curvature is accurate and relatively 
constant across the parameter space, the particle will have high initial momentum along directions 
where the posterior is concentrated and less along those where the posterior is more diffuse. 

Unless the (conditional) posterior is very well behaved, the Hessian should be calculated at the 
log-posterior mode to ensure positive definiteness. Maximization is generally an expensive opera- 
tion, however, so it is not feasible to update the Hessian every iteration of the sampler. In contrast, 
the log-prior curvature is very easy to calculate and well behaved everywhere. This led us to de- 
velop the scheduled conditional HMC sampler (SCHMC), an algorithm for nonconjugate Gibbs 
draws that updates the log-prior curvature at every iteration but only updates the log-likelihood 
curvature in a strategically chosen subset of iterations. We use this algorithm for all non-conjugate 
conditional draws in our Gibbs sampler. 

Specifically, suppose we want to draw from the conditional distribution p{6\%l> t , y) cc p(y\0, ij)t)p(0\ipt) 
in each Gibbs scan, where i/? is a vector of the remaining parameters and y is the observed data. 
Let S be the set of full Gibbs scans in which the log-likelihood Hessian information is updated 
(which always includes the first). For Gibbs scan i E S, we first calculate the conditional poste- 
rior mode and evaluate both the Hessian of the log-likelihood, logp(y\0, ^ t ), and of the log-prior, 
log p(0\ip t ), at that mode, adding them together to get the log-posterior Hessian. We then get a 
conditional posterior draw with HMC using the negative Hessian as our mass matrix. For Gibbs 
scan i ^ S, we evaluate the log-prior Hessian at the current location and add it our last evaluation 
of the log-likelihood Hessian to get the log-posterior Hessian. We then proceed as before. The 
SCHMC procedure is described in step-by-step detail in Algorithm 1. 

A.2 SCHMC implementation details for HPC model 

In the previous section we described our general procedure for obtaining samples from unnormal- 
ized conditional posteriors, the SCHMC algorithm. In this section, we provide the gradient and 
Hessian calculations necessary to implement this procedure for the unnormalized conditional den- 
sities in the HPC model, as well as strategies to obtain the maximum of each conditional posterior. 

A.2.1 Conditional posterior of the rate parameters 

The log conditional posterior of the rate parameters for one word is: 

logKM/l W, I, l, {rf}J =1 ^, 7 2 , u, o\ T) 

D 

= ^logPois^Mj^) + logA/> / |^l,A( 7 2 ,T / 2 ,T)) 
d=i 

D D j 

= - J2 Wdfr + E w t* lo § (^/) - 2 (M/ _ ^ TA ^f - w 

d=l d=l 
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Algorithm 1: Scheduled conditional HMC sampler for iteration i 

input : 9t-i, ipt (current value of other parameters), y (observed data), L (number of leapfrog steps), e 

(stepsize), and S (set of full Gibbs scans in which the likelihood Hessian is updated) 
output: t 

/* Update conditional likelihood Hessian if iteration in schedule */ 
if i e S then 

«- argmaxe {\ogp(y\0,ip t ) + logp(0|V>t)}; 
Hi(0)^aww \logp(y\0,ip t ) 



\0=9' 



end 



/* Calculate prior Hessian and set up mass matrix */ 

H{0)^Hi{0)+H p {0); 
M <- -H(0); 

/* Draw initial momentum */ 
Drawp* ~ Af(0,M); 

/* Leapfrog steps to get HMC proposal */ 
for I <- 1 to L do 

91 «- —55 [log p{0\ip t ,y)] |©=6>r_ i ; 

pii<-p*-i - Ifli; 

0t ^0*_ 1 + e(M-YPii, 

92 <- ~m pogp(0|V»t,v)] |e=fl?; 

K «" P*,i - §92; 
end 

/* Calculate Hamiltonian (total energy) of initial position */ 

#t-i <- |(pS) t m- 1 p5; 

U t -i < log p(0^\ip t ,y); 

H t _i <- Jf t _i + I7 t _i; 

/* Calculate Hamiltonian (total energy) of candidate position */ 

U* ^ -logp(0* L \ip t ,y); 
H* <- K* +U*; 

/* Metropolis correction to determine if proposal accepted */ 
Draw u~Unif [0,1]; 

logr^Ht^-H*; 

if log u < log r then 

t ^0* L 

else 

6>t <- 0t-i 

end 
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Since the likelihood is a function of (3f, we need to use the chain rule to get the gradient in \if 
space: 



o r 



logp(M/| W, I, I, {t]}) =1 ^, 7 2 , T) 



9 M/ . 



0Z(/3,) d(3 f d 




d(3f djAf dfif 



where o is the Hadamard (entrywise) product. The Hessian matrix follows a similar pattern: 



H(iogp(M/|w r ,i,i,{T / 2 }jr =1 ,^ 7 2 ,{&}* = i,r)) = -e r we p f p T f + g - a, 



We use the BFGS algorithm with the analytical gradient derived above to maximize this density 
for iterations where the likelihood Hessian is updated; this quasi-Newton method works well since 
the conditional posterior is unimodal. The Hessian of the likelihood in (3 space is clearly nega- 
tive definite everywhere since @ T W@ is a positive definite matrix. The prior Hessian A is also 
positive definite by definition since it is the precision matrix of a Gaussian variate. However, the 
contribution of the chain rule term G can cause the Hessian to become indefinite away from the 
mode in fi space if any of the gradient entries are sufficiently large and positive. Note, however, 
that the conditional posterior is still unimodal since the logarithm is a monotone transformation. 



where 




and 
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A.2.2 Conditional posterior of the topic affinity parameters 



The log conditional posterior for the topic affinity parameters for one document is: 

iogp(^|w,i,z,{ M/ ,T|};u^s) 

V 

= l d J2 l °S^ois(w fd \f3je d ) + log Bernoulli | £i) +logjV(^|»7, S) 



v 



V 



K 



= - [ d E z 3 / ° d + E ^ lo § (0/ - E 1 °g( 1 + ex p(-^)) 



/=i 



fe=i 



fe=i 



Since the likelihood of the word counts is a function of 6 d , we need to use the chain rule to get 
the gradient of the likelihood in £ d space. This mapping is more complicated than in the case of 
the fij parameters since each ^ dk is a function of all elements of d : 

Vl d (£ d ) = Vl d (O d ) T J(O d ^£ d ), 

where J(8 d — > £ d ) is the Jacobian of the transformation from 6 space to £ space, a K x K 
symmetric matrix. Let S = Ylf=i ex P£d«- Then 



J{0 d £,) = S" 



5 exp f dl - exp 2£ d i ... - exp(£ dX + 
- exp(^ d i + i d2 ) ... - exp(£ dX + £ d2 ) 



- exp(^i + £ dK ) ... S exp £ dX - exp 2^ 
The gradient of the likelihood of the word counts in terms of 6 d is 



7=i ^ 



Finally, to get the gradient of the full conditional posterior, we add the gradient of the likelihood 
of the labels and of the normal prior on the 



_d_ 



\o gP {i d \w,ix{ii f } v f=1 ^) 

= Vl d {O d ) T J{O d £ A ) + (1 + exp^)" 1 - (1 - I d ) ~ ^{id ~ rj). 



The Hessian matrix of the conditional posterior is a complicated tensor product that is not 
efficient to evaluate analytically. Instead, we compute a numerical Hessian using the analytic 
gradient presented above at minimal computational cost. 
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We use the BFGS algorithm with the analytical gradient derived above to maximize this density 
for iterations where the likelihood Hessian is updated. We have not been able to show analytically 
that this conditional posterior is unimodal, but we have verified this graphically for several doc- 
uments and have achieved achieved very high acceptance rates for our HMC proposals based on 
this Hessian calculation. 



A.2.3 Conditional posterior of the r 2 k hyperparameters 

The variance parameters r 2 k independently follow an identical Scaled Inverse-^ 2 with convolution 
parameter v and scale parameter a 2 , while their inverse follows a Gamma(/v = |, A T = -J^) 
distribution. The log conditional posterior of these parameters is: 

\og P (K T , x T \{r]}U, T) = {k t - 1) E lo s ^hr 1 

f=i kev 

1 v 

- \V\VK T \og\ T - \V\V\ogT( Kr ) -^-JQrJ)- 1 , 

At f=l keV 

where V(T) is the set of parent topics on the tree. If we allow i e {1, . . . , N — \V\V} to index all 
the /, k pairs and 1(k t , A t ) = p({t 2 }^ =1 \k, t: A t , T), we can simplify this to 

N N 

1(k t ,X t ) = (k t - 1) E lQ g ^r 2 "^r log A T -iV log r(/t T ) - -v-^rr 2 . 

i=i Ar *=i 



We then transform this density onto the (log n T , log A T ) scale so that the parameters are un- 
constrained, a requirement for standard HMC implementation. Each draw of (log k t , log A T ) is 
then transformed back to the (u, a 2 ) scale. To get the Hessian of the likelihood in log space, we 
calculate the derivatives of the likelihood in the original space and apply the chain rule: 



Hi /(logK r ,logA T ) = 



where 



and 



81(k t ,X t ) 



I ( K \2 d 2 l(n T ,\ T ) y 



8 2 1(k t ,X t ) 



8k t 1 y"" r I 8{k t ) 2 
x 8 2 l(n T ,X T ) 

^ T ^ T 8k t 8X t 



T T 8k t 8X t 
81(k t ,X t ) . /\ \2 d 2 l{K T ,\ T ) 



VI(k t , A t ) = 



H(1(k t ,X t ) 



Eti^gr- 2 - N\ogX T - N^(k t ) 

Nk t , _J_ sr-N 2 



X T 1 (A T ) 2 



-W(«r) 

_N_ Nk t 



N_ 

' X T 



T N T 



(Ar) 2 (A r ) 3 
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Following Algorithm 1, we evaluate the Hessian at the mode of this joint posterior. This is 
easiest to find on original scale following the properties of the Gamma distribution. The first order 
condition for A T can be solved analytically: 



The joint mode in the original space is then (k t ,mle, K,mle(Kt,mle))- Due to the monotonic- 
ity of the logarithm function, the mode in the transformed space is simply (log k T) mle, log X t ,mle)- 
We can be confident that the conditional posterior is unimodal: the Fisher information for a Gamma 
distribution is negative definite, and the log transformation to the unconstrained space is mono- 
tonic. 



K,mleM = argmax <^ 1(k t , X t ) I = — — t- 




We can then numerically maximize the profile likelihood of k t : 
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